3 Validation, Demultiplexing, and Quality Control

Amanda Birmingham, CCBB, UCSD (abirmingham@ucsd.edu)

Validating the Mapping File
Running Demultiplexing and Quality Control
Interpreting Demultiplexing Results
Interpreting Quality Control Results

Related Notebooks:

1 Introducing 16S Microbiome Primary Analysis
2 Setting Up Starcluster for QIIME
4 OTU Picking and Rarefaction Depth Selection
5 Analyzing Core Diversity

Validating the Mapping File

Because all downstream QIIME analyses depend on the meta-data in the mapping file (see Required Inputs section in Introduction to 16S Microbiome Primary Analysis), the first step in analysis is validating that the mapping file is formatted correctly. QIIME provdes a script for this purpose called validate_mapping_file.py; full details are available at http://qiime.org/scripts/validate_mapping_file.html . In brief, the script takes in a path to the mapping file to validate and a directory path to which to write outputs:

validate_mapping_file.py -m [mapping_file_path] -o [directory_path]

as shown in this example:

validate_mapping_file.py -m inputs/925_merged_prep_mapping2.txt -o mapping_validation/

Note that this script cannot be run in parallel, so it will use only one node of the cluster; however, since it is a fast step this doesn't present a problem.

The script assesses the mapping file for various errors and then writes a log file, an html file, and a corrected mapping file to the output directory. If there are any issues found, you will also see a message like this:

Errors and/or warnings detected in mapping file.  Please check the log and html file for details.

The HTML output and the log output both contain largely the same information, and I have not found it necessary to check both. When working on a cluster terminal, the log is easier to use; a simple call like

head mapping_validation/925_merged_prep_mapping.log

will show the top of the log file, which lists the errors and warnings found, as in this example:

# Errors and warnings are written as a tab separated columns, with the first column showing the error or warning, and the second column contains the location of the error or warning, written as row,column, where 0,0 is the top left header item (SampleID).  Problems not specific to a particular data cell will be listed as having 'no location'.
Errors -----------------------------
Warnings ---------------------------
Invalid characters found in "This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair ampli\_es the region 533\_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA\_id:470367).] The reverse PCR primer is barcoded with a 12-base errorcorrecting Golay code to facilitate multiplexing of up to ξ1,500 samples per lane, and both PCR primers contain sequencer adapter regions. The three sequencing primers include two for reading in from each end of the amplicon and a third for reading the barcode. Because of technical limitations at the sequencing facility, only part of the barcode was sequenced, so we were unable to exploit the error-correcting properties fully; however, even with partial barcodes we were able to resolve the samples, demonstrating the robustness of the approach. It is important to note that this primer collection allows for sequencing of paired-end reads, but the downstream data analyses are not yet capable of supporting paired-end reads. Our results illustrate interesting and correlated patterns based on analysis of the unpaired reads (i.e., \_ and \_ diversity evaluations based on the 5\_ only and 3\_ only reads independently achieve similar results, suggesting that 100 bases in this region of the 16S gene can allow for successful screening and comparison of microbial communities). The reads generated from these PCR primers are both identi\_ed as \_recommended��� regions by Liu et al. (15). Polymerase Chain Reaction. Sample preparation was performed similarly to that described by Costello et al. (1). Brie\_y, each sample was ampli\_ed in triplicate, combined, and cleaned using the MO BIO 96 htp PCR clean up kit. PCR reactions contained 13 \_L MO BIO PCR water, 10 \_L 5 Prime Hot Master Mix, 0.5 \_L each of the forward and reverse primers (10 \_M \_nal concentration), and 1.0 \_L genomic DNA. Reactions were held at 94 C for 3 min to denature the DNA, with ampli\_cation proceeding for 35 cycles at 94 ξC for 45 s, 50 ξC for 60 s, and 72 ξC for 90 s; a \_nal extension of 10 min at 72 ξC was added to ensure complete ampli\_cation. Cleaned amplicons were quanti\_ed using Picogreen dsDNA reagent in 10 mM Tris buffer (pH 8.0). A composite sample for sequencing was created by combining equimolar ratios of amplicons from the individual samples, followed by gel puri\_cation and ethanol precipitation to remove any remaining contaminants and PCR artifacts." 1,9
Invalid characters found in "This analysis was done as in Caporaso et al 2011 Genome research. The PCR primers (F515/R806) were developed against the V4 region of the 16S rRNA (both bacteria and archaea), which we determined would yield optimal community clustering with reads of this length using a procedure similar to that of ref. 15. [For reference, this primer pair ampli\_es the region 533\_786 in the Escherichia coli strain 83972 sequence (greengenes accession no. prokMSA\_id:470367).] The reverse PCR primer is barcoded with a 12-base errorcorrecting Golay code to facilitate multiplexing of up to ξ1,500 samples per lane, and both PCR primers contain sequencer adapter regions. The three sequencing primers include two for reading in from each end of the amplicon and a third for reading the barcode. Because of technical limitations at the sequencing facility, only part of the barcode was sequenced, so we were unable to exploit the error-correcting properties fully; however, even with partial barcodes we were able to resolve the samples, demonstrating the robustness of the approach. It is important to note that this primer collection allows for sequencing of paired-end reads, but the downstream data analyses are not yet capable of supporting paired-end reads. Our results illustrate interesting and correlated patterns based on analysis of the unpaired reads (i.e., \_ and \_

Even if there are no errors, it is a good idea to skim the types of warnings created in order to see what they were ... in my experience, invalid characters in text fields (like protocols) and incorrect column ordering (specifically, not having the description column as the last column in the file) are the most common, and are successfully dealt with by the automated correction. As mentioned above, the HTML file contains much the same information but in a more visual format:

Errors require you as the analyst to fix whatever problem was found (and then, of course, rerun the validation). If only warnings are detected, chances are that you do not make any changes to the mapping file yourself but can simply use the corrected mapping produced by the script. However, do remember to reference the corrected mapping file in all subsequent script calls rather than the original one, since the validation script does not overwrite or remove the original, issue-containing file!

Table of Contents

Running Demultiplexing and Quality Control

As described in the Background and Required Inputs sections of the 1 Introduction to 16S Microbiome Primary Analysis document, this SOP covers the analysis of 16S data sequenced using the three-read technique and employing Golay error-correcting barcodes. In this case, QIIME makes demultiplexing, or assigning each read to its source sample based on the barcode attached to it, very simple. Start by ensuring that you indeed have a fastq file containing the "real" sequences (i.e., the reads from the read primer) and a second fastq file containing the barcode reads (from the index primer); if you do not, chances are that the data was not generated with the three-read approach and will require more processing. Assuming you do, unzip any zipped fastqs with the gunzip command, as in this example:

gunzip *.fastq.gz

Now all that is necessary is a single call to split_libraries_fastq.py, which, as its name implies, splits the reads from the sequenced library (collection of DNA fragments) up into one set for each sample, based on the barcodes assigned to each sample. It may be something of a surprise to learn that it also encapsulates all the quality control checking in the QIIME pipeline, which is done automatically "under the hood"! Quality checks performed by this method include:

Truncating reads after a given number of consecutive low quality base calls (default = 3)
- Number not reported in log
Filtering out reads lacking at least a given fraction of consecutive high quality base calls in the read length (default = 0.75)
- Number reported on the "Read too short after quality truncation" line in log
Filtering out reads with more than a given number of 'N's, after quality trimming (default = 0)
- Number reported on the "Count of N characters exceeds limit" line in log
Filtering out reads with phred scores at or below a given number (default = 3)
- Note that this is equivalent to saying that only reads with phred scores of 4 or greater are retained
- Number not reported in log
Filtering out reads with more than a given number of errors in their barcodes (default = 1.5)
- Number reported on the "Barcode errors exceed max" line in log

The call is of the format

split_libraries_fastq.py -m [mapping file path] -i [sequences fastq file path] -b [barcode fastq file path] -o [output directory path]

as shown in this example:

split_libraries_fastq.py -m /data/yellowstone_gradients/mapping_validation/925_merged_prep_mapping_corrected.txt -i /data/yellowstone_gradients/s_6_1_sequences.fastq -b /data/yellowstone_gradients/s_6_1_sequences_barcodes.fastq -o /data/library_split/

As always, remember to use the corrected mapping file! Additionally, note that this step is also not parallelizable.

Table of Contents

Interpreting Demultiplexing Results

After a successful run, the output directory will contain a log file (split_library_log.txt), a tab-delimited text file listing the distribution of reads by length (histograms.txt), and--as the main goal--a fasta file containing the read sequences labeled by the sample from which they came (seqs.fna). The log file is a good place to start in assessing the results; running head on it will show the main findings, since they are reported at the top:

Input file paths
Mapping filepath: /data/yellowstone_gradients//mapping_validation/925_prep_98_qiime_20141124-091849_corrected.txt (md5: a4048567752a64487db5265bbf1043e6)
Sequence read filepath: /data/yellowstone_gradients//s_6_1_sequences.fastq (md5: 572afdd3b02343c64fee37b52f1123c2)
Barcode read filepath: /data/yellowstone_gradients//s_6_1_sequences_barcodes.fastq (md5: b9824c34b990751a15a5f7c1ce9f9354)

Quality filter results
Total number of input sequences: 15034373
Barcode not in mapping file: 6991839
Read too short after quality truncation: 658828
Count of N characters exceeds limit: 21
Illumina quality digit = 0: 0
Barcode errors exceed max: 888033

Result summary (after quality filtering)
Median sequence length: 90.00
925.in5x    46567
925.in4z    41714
925.io4x    41362
925.in5y    39376
925.sa4x    39169
925.im4z    38288
925.sc2x    37312
925.in4y    34720
925.sa2y    34691
925.io1x    34424
925.sb1z    34196

In this example, we see the total number of input sequences followed by summary statistics about reads lost to various quality filters. The large number with "Barcode not in mapping file" is not a concern in this case; as the name implies, it represents reads whose barcode is not listed in the mapping file for this experiment, and it is common for this to be large when sequencing libraries from multiple experiments are combined in the same (usually HiSeq) sequencing run. Simply, these are reads from other experiments that were sequenced in the same run. However, if you have reason to believe these are a concern given a particular experimental/run design, you can pass the --retain_unassigned_reads switch to split_libraries_fastq.py to have these included in the sequence output with the sample id "Unassigned" so that you can examine them.

Table of Contents

Interpreting Quality Control Results

After these summaries, the log contains a list of each sample and the number of quality-check-passing reads assigned to it, as well as (down at the bottom) the total number of reads that passed quality-checking. The histogram file also contains count information, but segregated by read length bin instead of sample id:

Length  Count
69.0    455092
79.0    742271
89.0    5298289
99.0    0
--

As usual, it is difficult to set firm thresholds for what is "acceptable" quality and what is "unacceptable". However, evidence suggests that useful determinations can be made for even relatively low numbers (10s to 100s) of sequences per sample, as shown in this figure from "Direct sequencing of the human microbiome readily reveals community differences" (Kuczynski J, Costello EK, Nemergut DR, et al. Genome Biology. 2010;11(5):210, available online at http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2898070/):

**Figure 2** ![](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2898070/bin/gb-2010-11-5-210-2.jpg) **Variation in human body habitats within and between people**. **(a)** The full dataset (approximately 1,500 sequences per sample); <**>(b)** the dataset sampled at only 10 sequences per sample, showing the same pattern; **(c) ** the relationship between sequencing depth and the PERMANOVA component of variation. The amount of variation explained by the factors plateaus at relatively shallow sequencing depths. Note that the proportion of variation captured by differences between the samples (that is, residual variation) is still highest despite the explanatory values of the three factors examined. **(d)** Effect size determines the number of sequences required for sample identification. Each point in the figure represents a specific sample selected from a pair of body sites, and the number of sequences required to correctly distinguish which site the sample originated from. The point is colored according to the two body sites under consideration, the center's color represents the broad category the selected sample originated from, the border color represents the other broad category under consideration. Many body sites share the same broad category, and thus some points have the same border and center coloring. Red, external ear canal; yellow, hair; green, oral cavity; blue, gut; magenta, skin; gray, nostril. ns, not significant.

Given this, if most of the samples have 100 or more reads after quality filtering, analysis is likely feasible.

Table of Contents



In [ ]: